distance profile
Castor: Competing shapelets for fast and accurate time series classification
Shapelets are discriminative subsequences, originally embedded in shapelet-based decision trees but have since been extended to shapelet-based transformations. We propose Castor, a simple, efficient, and accurate time series classification algorithm that utilizes shapelets to transform time series. The transformation organizes shapelets into groups with varying dilation and allows the shapelets to compete over the time context to construct a diverse feature representation. By organizing the shapelets into groups, we enable the transformation to transition between levels of competition, resulting in methods that more closely resemble distance-based transformations or dictionary-based transformations. We demonstrate, through an extensive empirical investigation, that Castor yields transformations that result in classifiers that are significantly more accurate than several state-of-the-art classifiers. In an extensive ablation study, we examine the effect of choosing hyperparameters and suggest accurate and efficient default values.
Matching via Distance Profiles
In this paper, we introduce and study matching methods based on distance profiles. For the matching of point clouds, the proposed method is easily implementable by solving a linear program, circumventing the computational obstacles of quadratic matching. Also, we propose and analyze a flexible way to execute location-to-location matching using distance profiles. Moreover, we provide a statistical estimation error analysis in the context of location-to-location matching using empirical process theory. Furthermore, we apply our method to a certain model and show its noise stability by characterizing conditions on the noise level for the matching to be successful. Lastly, we demonstrate the performance of the proposed method and compare it with some existing methods using synthetic and real data.
Error-bounded Approximate Time Series Joins Using Compact Dictionary Representations of Time Series
Yeh, Chin-Chia Michael, Zheng, Yan, Wang, Junpeng, Chen, Huiyuan, Zhuang, Zhongfang, Zhang, Wei, Keogh, Eamonn
The matrix profile is an effective data mining tool that provides similarity join functionality for time series data. Users of the matrix profile can either join a time series with itself using intra-similarity join (i.e., self-join) or join a time series with another time series using inter-similarity join. By invoking either or both types of joins, the matrix profile can help users discover both conserved and anomalous structures in the data. Since the introduction of the matrix profile five years ago, multiple efforts have been made to speed up the computation with approximate joins; however, the majority of these efforts only focus on self-joins. In this work, we show that it is possible to efficiently perform approximate inter-time series similarity joins with error bounded guarantees by creating a compact "dictionary" representation of time series. Using the dictionary representation instead of the original time series, we are able to improve the throughput of an anomaly mining system by at least 20X, with essentially no decrease in accuracy. As a side effect, the dictionaries also summarize the time series in a semantically meaningful way and can provide intuitive and actionable insights. We demonstrate the utility of our dictionary-based inter-time series similarity joins on domains as diverse as medicine and transportation.
Three iterations of $(d-1)$-WL test distinguish non isometric clouds of $d$-dimensional points
Rose, Valentino Delle, Kozachinskiy, Alexander, Rojas, Cristรณbal, Petrache, Mircea, Barcelรณ, Pablo
The Weisfeiler--Lehman (WL) test is a fundamental iterative algorithm for checking isomorphism of graphs. It has also been observed that it underlies the design of several graph neural network architectures, whose capabilities and performance can be understood in terms of the expressive power of this test. Motivated by recent developments in machine learning applications to datasets involving three-dimensional objects, we study when the WL test is {\em complete} for clouds of euclidean points represented by complete distance graphs, i.e., when it can distinguish, up to isometry, any arbitrary such cloud. Our main result states that the $(d-1)$-dimensional WL test is complete for point clouds in $d$-dimensional Euclidean space, for any $d\ge 2$, and that only three iterations of the test suffice. Our result is tight for $d = 2, 3$. We also observe that the $d$-dimensional WL test only requires one iteration to achieve completeness.
Matrix Profile XXVII: A Novel Distance Measure for Comparing Long Time Series
Der, Audrey, Yeh, Chin-Chia Michael, Wu, Renjie, Wang, Junpeng, Zheng, Yan, Zhuang, Zhongfang, Wang, Liang, Zhang, Wei, Keogh, Eamonn
The most useful data mining primitives are distance measures. With an effective distance measure, it is possible to perform classification, clustering, anomaly detection, segmentation, etc. For single-event time series Euclidean Distance and Dynamic Time Warping distance are known to be extremely effective. However, for time series containing cyclical behaviors, the semantic meaningfulness of such comparisons is less clear. For example, on two separate days the telemetry from an athlete workout routine might be very similar. The second day may change the order in of performing push-ups and squats, adding repetitions of pull-ups, or completely omitting dumbbell curls. Any of these minor changes would defeat existing time series distance measures. Some bag-of-features methods have been proposed to address this problem, but we argue that in many cases, similarity is intimately tied to the shapes of subsequences within these longer time series. In such cases, summative features will lack discrimination ability. In this work we introduce PRCIS, which stands for Pattern Representation Comparison in Series. PRCIS is a distance measure for long time series, which exploits recent progress in our ability to summarize time series with dictionaries. We will demonstrate the utility of our ideas on diverse tasks and datasets.
Measuring Similarity of Interactive Driving Behaviors Using Matrix Profile
Lin, Qin, Wang, Wenshuo, Zhang, Yihuan, Dolan, John
-- Understanding multi-vehicle interactive behaviors with temporal sequential observations is crucial for autonomous vehicles to make appropriate decisions in an uncertain traffic environment. On-demand similarity measures are significant for autonomous vehicles to deal with massive interactive driving behaviors by clustering and classifying diverse scenarios. This paper proposes a general approach for measuring spatiotemporal similarity of interactive behaviors using a multivariate matrix profile technique. The key attractive features of the approach are its superior space and time complexity, real-time online computing for streaming traffic data, and possible capability of leveraging hardware for parallel computation. The proposed approach is validated through automatically discovering similar interactive driving behaviors at intersections from sequential data. One of the biggest challenges for deploying autonomous vehicles (A Vs) in real life is the requirement of the A Vs' capability to interact with surrounding road users. Classifying diverse scenarios and separately designing appropriate decisions using on-hand prior knowledge is unfortunately not realistic [1] because of the diversity of scenarios that are far larger and messier than human beings can cope with [2].
Towards a Near Universal Time Series Data Mining Tool: Introducing the Matrix Profile
Towards a Near Universal Time Series Data Mining Tool: Introducing the Matrix Profile by Chin-Chia Michael Yeh Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, September 2018 Dr. Eamonn Keogh, Chairperson The last decade has seen a flurry of research on all-pairs-similarity-search (or, self-join) for text, DNA, and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. Surprisingly, however, little progress has been made on addressing this problem for time series subsequences. In this thesis, we have introduced a near universal time series data mining tool called matrix profile which solves the all-pairssimilarity-search problem and caches the output in an easy-to-access fashion. The proposed algorithm is not only parameter-free, exact and scalable, but also applicable for both single and multidimensional time series. By building time series data mining methods on top of matrix profile, many time series data mining tasks (e.g., motif discovery, discord discovery, shapelet discovery, semantic segmentation, and clustering) can be efficiently solved. Because the same matrix profile can be shared by a diverse set of time series data mining methods, matrix profile is versatile and computed-once-use-many-times data structure. We demonstrate the utility of matrix profile for many time series data mining problems, including motif discovery, discord discovery, weakly labeled time series classification, and vi representation learning on domains as diverse as seismology, entomology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring, and medicine. We hope the matrix profile is not the end but the beginning of many more time series data mining projects.